Skip to content

Conversation

@Kangyan-Zhou
Copy link
Collaborator

@Kangyan-Zhou Kangyan-Zhou commented Nov 1, 2025

Add an option to only deploy model agent to gpu nodes so that if we won't waste resource if we don't have CPU serving.

Add option to use Ubuntu 22.04 instead of the default Oracle Linux.

Add cache for build to reduce build time across the board, especially xnet.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @Kangyan-Zhou, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new capability to deploy the ome model agent specifically onto Kubernetes nodes equipped with GPUs, providing finer control over deployment targets. Simultaneously, it significantly optimizes the build process for various components by integrating advanced caching mechanisms for Go and Docker, which will lead to faster and more efficient builds. Additionally, it enhances the flexibility of Dockerfiles by allowing configurable base images and improves error reporting in a utility function.

Highlights

  • GPU Node Deployment for Model Agent: Introduced a new option to deploy the model agent exclusively to GPU-enabled Kubernetes nodes, allowing for more precise resource allocation. This is controlled via modelAgent.gpuNodesOnly and modelAgent.gpuNodeLabel in values.yaml.
  • Comprehensive Build Caching Improvements: Implemented significant enhancements to Go and Docker builds across multiple components by leveraging Docker BuildKit's cache mounts for Go modules and build caches. This includes updates to .gitignore, Makefile, and all relevant Dockerfiles.
  • Dockerfile Base Image Flexibility: The model-agent and ome-agent Dockerfiles now support a configurable base image (BASE_IMAGE build argument) and include conditional logic to adapt package installations for both microdnf (Oracle Linux) and apt-get (Ubuntu/Debian) based systems.
  • Enhanced Error Reporting: Improved the SnapshotDownload function to collect and report detailed error messages alongside file paths when errors occur during the download process, aiding in debugging.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@Kangyan-Zhou Kangyan-Zhou changed the title Ome agent gpu Make OME Deployment Flexible with GPU; Improve Build Cache Nov 1, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces two main improvements: flexible deployment of the model agent to GPU nodes and enhanced build caching for Docker and Go. The changes to the Helm charts add valuable flexibility for GPU-specific deployments. The Dockerfile and Makefile modifications significantly improve build times by leveraging caching, which is a great enhancement for the development workflow. I've found a critical issue in the Helm template logic that needs to be addressed, along with a couple of medium-severity suggestions for improving the Makefile and error handling.

containers:
- name: model-agent
image: {{ include "ome.imageWithHub" (dict "values" .Values "repository" .Values.modelAgent.image.repository "tag" .Values.modelAgent.image.tag) }}
image: {{ include "ome.imageWithHub" (dict "values" (merge (dict "global" (dict "hub" (default .Values.global.hub .Values.modelAgent.image.hub))) .Values) "repository" .Values.modelAgent.image.repository "tag" .Values.modelAgent.image.tag) }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This logic for overriding the image hub appears to have two issues:

  1. The default function arguments are swapped. It should be (default .Values.modelAgent.image.hub .Values.global.hub) to prioritize the model-agent specific hub, as documented in values.yaml.
  2. The merge function arguments are also swapped. Helm's merge gives precedence to the right-most dictionary, so .Values is currently overwriting your hub override. The arguments should be swapped to merge the override into .Values.

Here is the corrected line:

        image: {{ include "ome.imageWithHub" (dict "values" (merge .Values (dict "global" (dict "hub" (default .Values.modelAgent.image.hub .Values.global.hub)))) "repository" .Values.modelAgent.image.repository "tag" .Values.modelAgent.image.tag) }}

export GOMODCACHE

# Ensure cache directories exist
$(shell mkdir -p $(GOCACHE) $(GOMODCACHE))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The $(shell mkdir -p ...) command is executed every time make is invoked, during the parsing phase. While mkdir -p is idempotent, this is generally not a good practice as it can have side effects and is not explicit. It's better to create directories as part of a target's recipe or as a prerequisite for targets that need them. For example, you could create a .PHONY setup target and add it as a dependency to your build targets.

if totalErrors > 0 {
// Collect error details
var errorFiles []string
var errorMessages []string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

You've added errorMessages to collect more detailed errors, which is a great improvement for debuggability. However, this new slice is populated but never used later in the function. The function still prints the less informative errorFiles slice and returns a generic error message. Consider using errorMessages to provide more detailed feedback to the user on failure, for example by joining them into the returned error string or logging them.

@Kangyan-Zhou Kangyan-Zhou changed the title Make OME Deployment Flexible with GPU; Improve Build Cache Make OME Deployment Flexible with GPU only option; Improve Build Cache and Docker Build Nov 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant